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Abstract — In a distributed storage system based on erasure 
coding, an important problem is tlie repair problem: If a node 
storing a coded piece fails, in order to maintain the same level 
of reliability, we need to create a new encoded piece and store it 
at a new node. This paper presents a construction of systematic 
(n, fc)-MDS codes for 2k < n that achieves the minimum repair 
bandwidth when repairing from k + 1 nodes. 



I. Introduction 

It is well known that erasure coding can be used to effec- 
tively provide reliability against node failures in a data storage 
system. For instance, we can divide a file of size B into k 
pieces, each of size B/k, encode them into n coded pieces 
using an (n, k) maximum distance separable (MDS) code, and 
store them at n nodes. Then, the original file can be recovered 
from any set of k coded pieces. This is optimal in terms of 
the redundancy-reliability tradeoff because k pieces, each of 
size B/k, provide the minimum data for recovering the file, 
which is of size B. 

One of the challenges for erasure coding-based distributed 
storage is the repair problem (introduced in [1]): If a node 
storing a coded piece fails or leaves the system, in order to 
maintain the same level of reliability, we need to create a new 
encoded piece and store it at a new node. If the source file is 
not available in the system (e.g., in an archival application), 
then the repair has to be done by accessing other encoded 
data only. A straightforward way to repair a failed node in 
a system based on (n, fc)-MDS code is to let the new node 
download k encoded pieces from a subset of the surviving 
nodes, reconstruct the original file, and compute the needed 
new coded piece. In this process, the new node incurred a 
network traffic of k x B/k — B. Since network bandwidth 
could be a critical resource in distributed storage systems, 
an important consideration is to conserve the repair network 
bandwidth. 

The repair problem amounts to the partial recovery of the 
code, whereas conventional erasure code design focused on 
the complete recovery of the information from a subset of the 
coded pieces. The consideration of the repair network traffic 
gives rise to new design challenges for erasure codes. This 
problem and its variants have been studied in recent years 
and various code constructions have been proposed. Next we 
briefly review the related existing work on the construction of 
erasure codes with reduced repair bandwidth. 

In this paper, we focus on {n, A;)-MDS codes, because they 
achieve the optimal reliability-storage tradeoff. Via a cut- 
based analysis, Dimakis et at. [1] presented a lower bound on 

Yunnan Wu is with Microsoft Research, One Microsoft Way, Redmond, 
WA, 98052. yunnanwu@microsof t . com. 



the network bandwidth needed to repair one node in an (n, k)- 
MDS code. Under a symmetric setup where the replacement 
node downloads the same number of bits from each of d 
nodes, it is shown that the total repair traffic has to be at 



least 



Bd 
fc(d-fe+l) 



The same bound for total repair traffic in fact 



also holds even if we relax the symmetric setup; this will be 
explained in Section |II] 

The cut lower bound on total repair traffic has been shown 
in [l]-[4] to be achievable using network coding, if we adopt a 
relaxed notion of repair — function repair, where the repaired 
code continues to be (ri, fc)-MDS but it may be different 
from the original code before the repair However, it is not 
clear that this network coding scheme can be made always 
systematic (i.e., one copy of the data exists in uncoded form). 
From a practical standpoint, it is highly desirable to have the 
systematic feature, so that in normal cases, data can be read 
directly from the uncoded copy, without performing decoding. 

Motivated in part by the pursuit of a systematic code with 
reduced repair bandwidth, in [5], Wu and Dimakis formulated 
a variant of the repair problem, called the exact repair prob- 
lem, where the same code is always maintained before and 
after the repair. For the exact repair problem, [5] presented 
an interference alignment scheme and a vector version of it. 
The interference alignment scheme can achieve the cut bound 
fc(d~fc+i) 2)-MDS and the resulting code is systematic. 

However, the scheme cannot achieve the cut bound for general 
k. 

Functional repair and exact repair are not the only possible 
models. In a recent work, Rashmi K.V. et al. [6] proposed 
a code construction that can achieve the cut bound for d = 
k + 1. The construction of [6] essentially implements a hybrid 
functional and exact repair model. In the scheme, each node 
stores 2 symbols, y^Ui and y'^Vi + z^Ui, where the 2k 
original information symbols are represented by two vectors 



y e 



and z e F . The vectors {ui\ can be chosen as 



the n code vectors of an (n, A;)-MDS code. If node i fails, the 
first symbol y^Ui is exactly reconstructed; the second symbol 
y'^Vi + z'^Ui is repaired to a new symbol that has the same 
form y^ * ^z'^Ui. Since can be chosen based on any 
{n, fc)-MDS code, we can in particular use a systematic (n, k)- 
MDS code. Thus, the code can expose half of the information 
symbols, y, in uncoded form. 

Having explained several repair models, we now reflect 
on the practical needs again. Both the MDS feature and the 
systematic feature are highly desirable in practice. However, 
providing the systematic feature does not necessarily require 
all symbols be exactly reconstructed. This motivates us to 
explore one avenue - Look for a systematic MDS code with a 
hybrid functional and exact repair model, where the systematic 
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symbols are exactly reconstructed and the nonsystematic sym- 
bols follow a functional repair model. Heading this direction, 
in this paper we present a construction of {n, /c)-MDS codes 
for 2k < n that achieves the minimum repair bandwidth when 




repairing from k + 1 nodes. aix^m + ,3is^«, ■■■ a^+i^^^t+i + A+ia^^i't+i 



II. Review: Cutset Bound on Total Repair Traffic 

In this section we describe the cut bound for total repair 
traffic. The analysis amounts to a slight extension of the 
analysis in [1], [2]. Specifically, in [1], [2] the replacement 
node downloads the same number of bits from each of d nodes; 
in the following Lemma [T] the replacement node is allowed to 
download any number of bits from the each of d nodes. The 
same bound on total network traffic still holds. 

Lemma 1: Consider B bits being stored via an (n, fc)-MDS 
code at n nodes, where each node stores a = B/k bits. To 
repair any failed storage node by accessing d > k nodes, the 
total incurred network traffic is at least 




Fig. 1 . Illustration of the proof of Lemma \T\ 

Proof: As in [1], [2], we consider the information flow 
graph that describes the repair problem as a network com- 
munication problem. The information flow graph is illustrated 
in Figure [T] In this graph, each storage node is represented 
by a pair of nodes, say m,; and outi, connected by an edge 
whose capacity is a, the storage capacity of the node. There 
is a source node, s, which has the entire file. The source has 
infinite capacity edges to the n storage nodes before the repair 
In Figure \T\ storage node 1 fails and we create a new storage 
node, node 5, which downloads f3i bits from each of the three 
surviving nodes and then stores a bits; this is represented in 
Figure[T]by the edges out2in^, out^irif,, and outline, that enter 
node 5. There are also data collectors, each corresponding to 
one request to reconstruct the original data from a subset of 
the nodes. For example, the data collector t in Figure [T] has 
infinite capacity edges from nodes 2 and 5, modeling that the 
file needs to be reconstructed by accessing storage nodes 2 and 
5. By analyzing the cut between s and the data collectors in 
the information flow graph, we can obtain bounds on the repair 
traffic. In particular, if the minimum cut between s and a data 
collector t is less than the size of the file, then we can conclude 
that it is impossible to reconstruct the file, regardless of what 
code we use. In the following, we use this cut argument to 
establish a bound on the total network traffic. 

Without loss of generality, suppose the first storage node 
fails and node n + 1 recovers the content stored at node 1 by 
downloading fii bits from node i + 1 for i = 2,...,(i+l. 
Consider a data collector t that connects to node n + 1 and a 



Fig. 2. Illustration of the proposed scheme. 

set P of fc — 1 Other nodes in {2, . . . , (i+ 1}. Consider an s-t 
cut {U,U) with 

U = {t, m„+i, out„+i} U {outi ■ i e P}- (1) 

This is illustrated by Figure [T] Then we obtain a bound by 
requiring that the capacity of the cut is at least B 

(fc-l)a + ^/3, >B. (2) 

For each [k — l)-subset P C {2, . . . , d + 1}, we can obtain 
one inequality like (|2]l. Summing up all these inequalities, we 
have that: 

Thus 



> 



Bd 



k{d-k + l) 



III. The Code Construction 

The proposed scheme is illustrated in Figure |2] Let F denote 
the finite field where the code is defined in. In Figure |2] 
X G F^''' is a vector consisting of the 2k original information 
symbols. Each node stores 2 symbols, x^Ui and x^Vi. The 
vectors {ut] do not change over time. The vectors {vi} 
changes over time as the code repairs. We maintain the 
invariant property that the 2n length-2fc vectors {ui, Vi} form 
an (277,, 2fc)-MDS code; that is, any 2k vectors in the set 
{ui^Vi} has full rank 2k. This certainly implies that the n 
nodes form an (n, fc)-MDS code. We initialize the code using 
any (2n, 2k) systematic MDS code over F. 

Now we consider the situation of a repair. Without loss of 
generality, suppose node n failed and is repaired by accessing 
nodes 1, . . . , fc + L As illustrated in Figure |2] the replace- 
ment node downloads aiX^Ui + (3ix'^Vi from each node of 
{l,...,fc + l}. Using these fc + 1 downloaded symbols, the 
replacement node computes two symbols x^Un and x^v'^ as 
follows: 



k+1 

i=l 
k+l 



ix'^Ui + Pix'^Vi) = x'^Un 



Pi {aiX^Ui + Pix'^Vi) = x'^v'^ 



(3) 



(4) 
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Note that v'^ is allowed to be different from v„; the property 
that we maintain is that the repaired code continues to be an 
(2n, 2fc)-MDS code. Here {ai, (3i, pi} and v'^ are the variables 
that we can control. The following theorem shows that we can 
choose these variables so that (O and (|4|l are satisfied and the 
repaired code continues to be an (2n, 2fc)-MDS code. 

Theorem 1: Let F be a finite field whose size is greater 
than 



do 



2n 
2k 



(5) 



Suppose the old code specified by {ui,Vi} is an (2n, 2fc)- 
MDS code defined over F. When node n fails, there exists an 
assignment of the variables {ai,(3i,pi} such that (O and (|4|i 
are satisfied and the repaired code continues to be an (2n, 2k)- 
MDS code. 

Proof: We begin by examining the condition (|3]l. Introduce 



V = [ai,/3i, .. ., ak+i,l3k+iY 
A = [ui,vi, . . .,Uk+i,Vk+i] 



(6) 
(7) 



Let i]i denote the i-th entry of tj; let denote the i-th column 
of A. Then the condition Q can be equivalently written in 
matrix form as: 



At] = u„ 



(8) 



Suppose two arbitrary entries of rj, say rji and r]j, are fixed at 
arbitrary given values. Let 'n\{i j} denote the subvector of 77 
after removing the i-th and j-th entry and denote the 

submatrix of A after removing the i-th and j-th column. Since 
Ml, ... , M„, Vi, . . . ,Vn are code vectors of an (2n, 2k) MDS 
code, any 2k columns of A have full rank 2fc; in particular, 
invertible. Then, to satisfy Ar) = m„ with given rji 
and rij, 'n\{i j} uniquely determined as 

Thus, the solutions to ^ have two degrees of freedom, with 
exactly F^ solutions. Given any two entries of rj, say rji and 
rjj, there is a unique solution to (|3]l and the other entries are 
affine functions of rji and rjj. In particular, we can consider 
771 = ai,ri2 = f3i as the two free parameters after considering 
©. 

After considering (O, we are left with fc + 3 degrees of free- 
dom that we can tune. Let the variables {ai, (3i, pi, . . . , pk+i} 
be collectively represented by a vector £ with k + 3 entries in 
F. Next we examine From (|4]l, v' is determined as: 



fe+i 



whose columns are given by the vectors in U indexed by S. 
Then the {2n, 2fc)-MDS condition boils down to: 



n 



det{[Us,v'J)^0. (11) 

SC{l,...,2n-l}, |S'|=2fc-l 

From ([Tol l and the discussion above, we see that each entry of 
v'^ is a multivariate polynomial in ^. This implies that the left 
hand side of ( fTTT i can be viewed as a multivariate polynomial 
in ^; it can be shown that the total degree of this polynomial 
is at most d^. 

Claim 1: For any S* C {1, . . . , 2n - 1} with |5| = 2fc - 1, 
det {[Us, v'„]) ^ for some | e F'=+3. 

Proof of Claim: The replacement node downloads one 
symbol each from nodes 1, . . . , k+1. Each node i of 1, ... , fc+ 
1 stores a pair of symbols x'^Ui and x^Vi. The matrix Us 
has 2fc — 1 columns; we also view it as a set of 2fc — 1 column 
vectors. Thus there must exist a node, say i*, in 1, . . . , fc + 1, 
satisfying either Ui- ^ Us or Vi- ^ Us or both. 

Suppose Vi- ^ Us for i* £ {!,..., A: + 1}. From the 
discussion earlier, given any two entries of 77, there is a unique 
solution to (|3]l. In particular, we can let ai- = and Pi- = 1; 
this maps uniquely to one assignment of ai and /3i, according 
to (|9]l. We further choose pi- = 1 and all other pi = 0. With 
this choice of ^, v'^ = Vi- . Since Vi- ^ Us and the old code 
{tti, Di, . . . , M„, Vn} was an (2n, 2fc)-MDS code. 



Aet{[Us,v'^])^Aet{[Us,v,-])^Q, 

for this choice of ^. 

The case Ui- ^U s follows similarly. 



(12) 



Claim [U implies that det {\U StV'^-^) is a nonzero multivari- 
ate polynomial in ^, which further implies that the left hand 
side of (fTTT l is a nonzero multivariate polynomial in ^. From 
the Schwartz-Zippel Theorem (quoted below as Lemma|2]l, for 
a finite field whose size is |F| > do, there exists an assignment 
of I e F''+^ such that dllj holds. Thus the theorem follows. 



Lemma 2 (Schwartz-Zippel Theorem (see, e.g., [7])): 

Let Q{xi,...,Xn) G V[xi,...,Xn] be a multivariate poly- 
nomial of total degree c?o (the total degree is the maximum 
degree of the additive terms and the degree of a term is the 
sum of exponents of the variables). Fix any finite set § C F, 
and let ri , . . . , r„ be chosen independently and uniformly at 
random from S. Then if Q{xi, . . . ,Xn) is not equal to a zero 
polynomial. 



v'n = ^ Pi{ctiUi + PiVi). 



(10) 



Pr[g(ri,...,r„)=0]<^, 



(13) 



It remains to prove that we can choose ^ G 1?'^+'' such that 
the new code {ui, . . . , Un, Vi, . . . , Vn-i,v'^} continues to be 
an (2n, 2fc)-MDS code. 

Since the old code {mi, . . . , m„, was an {2n,2k)- 
MDS code, we just need to prove that v'^ can be made 
linearly independent of any 2k — 1 subset of U = 
{ui, . . . ,Un,Vi, . . . ,Vn-i]- For any 2k — 1 subset S of 
{1, . . . ,2n 1}, let Us denote the 2k x {2k - 1) matrix 



Corollary 1 (A Systematic {n, fc)-MDS Code): 

The above scheme gives a construction of systematic {n, k)- 
MDS codes for 2k < n that achieves the minimum repair 
bandwidth when repairing from k + 1 nodes. 
Proof: Consider n > 2k. Note that in the above scheme, 
we can initialize the code {ui, . . . , u„, Vi, . . . , v„} with 
any (2n, 2fc)-MDS code. In particular, we can use a sys- 
tematic code and assign the 2k systematic code vectors to 
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{ui, . . . , tt2fc}. Since {ui, . . . , m„} do not change over time, 
the code remains a systematic (2n, 2fc)-MDS code. Thus the n 
nodes form a systematic (n, fc)-MDS code. The code repairs a 
failure by downloading fc + 1 symbols from d = k + 1 nodes, 
with the total file size is B = 2k. This achieves the cut bound 
given in Lemma [T] ■ 



A. Code Construction Algorithm 

From the proof of Theorem[T]and the Schwartz-Zippel The- 
orem, for a sufficiently large finite field F, if we independently 
and uniformly draw each entry of ^, then (fTTT i will hold with 
high probability. This can be used to develop a randomized 
code construction procedure. 

We initialize the code using any (2n, 2k) systematic MDS 
code over F. Subsequently, for each repair, we randomly draw 
a vector ^ from F^'^. For each drawing of ^, we compute 
the resulting and check whether it is linearly independent 

from the (^"^ | ^ subsets of it„, Vi, t)„_i} 

with cardinality 2k — 1. The random drawing process can be 
repeated until the desired property is met. 

B. Structural Comparison with Other Schemes 

The above code scheme has a simple structure. It starts with 
any given (2n, 2fc)-MDS code, with n code vectors exactly 
maintained. The other n code vectors evolve over time as the 
code repairs. The invariant property is that the code is always 
a (2n, 2fc)-MDS code (hence an (n, fc)-MDS code). We now 
compare the structure of this code with other existing schemes. 

Since the proposed scheme works for d = k + 1, we 
only consider the case d = A; + 1 in the comparison. For 
d = k+1, all schemes to be discussed below store two symbols 
at each node, which are linear combinations of the 2k original 
information symbols. Thus all these schemes can be expressed 
in the same notation, where x denotes the 2k information 
symbols and node i stores x'^Ui and x^Vi. However, the 
schemes differ in additional structural properties imposed to 
the code and also in how the repair is done. 

First, the network coding scheme for the functional repair 
model [l]-[4] achieves the cut bound on total repair band- 
width. The code has a looser structure compared to the code 
proposed in this paper. In each repair, the two symbols can 
be repaired to two new symbols. The only requirement is that 
the 2n code vectors always form an (n, /c)-MDS. However, in 
doing so, it is hard, if not impossible, to provide the systematic 
feature. 

Second, the interference alignment scheme for the exact 
repair model [5] achieves the cut bound on total repair 
bandwidth for k = 2 but not for general k. The code is 
formed by two rows of (71, fc)-MDsQ, each involving half of the 
variables. More precisely, let the original information symbols 
X be split into two vectors y and z, each of length fc. Then the 
first row of the code is a systematic (n, fc)-MDS code applied 
to y and the second row is a different systematic (n, fc)-MDS 

'The code is viewed as a 2 X n matrix (see Figure |2j, where the columns 
correspond to the n nodes. 



code applied to z. The requirement is that each row is (n, fc)- 
MDS and certain interference cancelation condition must be 
met. In this code, the first fc node store the 2k systematic 
symbols; in comparison, in the proposed code, the systematic 
symbols are spread in the first row, across the nodes. 

Third, the scheme of [6] achieves the cut bound on total 
repair bandwidth. In this scheme, the 2fc original information 
symbols are represented by two vectors y and z, each of length 
k. Each node stores y^Ui and y^Vi + z^Ui. The first row of 
the code, given by the vectors [ui], do not change over time 
and they can be any (rt, fc)-MDS code. The second row of the 
code is essentially the same code [ui] applied to z plus a 
linear function of y. Here the vectors {vi] changes over time 
as the code repairs; they can assume arbitrary values. 
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